Sains Malaysiana 53(7)(2024): 1715-1728
http://doi.org/10.17576/jsm-2024-5307-18
Machine
Learning for Mapping and Forecasting Poverty in North Sumatera: A Data-Driven
Approach
(Pembelajaran Mesin untuk Pemetaan dan Ramalan Kemiskinan di Sumatera Utara: Pendekatan Dipacu Data)
ARNITA*, FARIDAWATY MARPAUNG, FANNY
RAMADHANI & DEWAN DINATA
Department
of Mathematics, Universitas Negeri Medan, Jl. Williem Iskandar Pasar V, Medan, Indonesia
Received:
20 August 2023/Accepted: 13 May 2024
Abstract
Discussing poverty is crucial
because it affects many facets of society, including socioeconomic disparity,
crime, and the inability to obtain high-quality education. One of the provinces
with the highest poverty rate in Indonesia is North Sumatra. A strategy is
required to gather accurate data to effectively reduce poverty. Poverty mapping
and prediction were conducted in North Sumatra to get a precise spatial
distribution of poverty, the operation of the poverty model, and forecasting
using machine learning (ML). Poverty prediction was conducted using a random
forest (RF) algorithm and poverty mapping was conducted using the K-Means
algorithm. The poverty mapping showed a significant inertia value decline in
the third and fourth clusters of the elbow graph. The third cluster (0.313) was
superior to the fourth cluster (0.244) in the silhouette index. Thus, there
were three poverty clusters - low, medium, and high - that were used in the
model. The best model was created using the grid search cross-validation, while
the best prediction results were created using the RF algorithm, with the
following parameters: n-estimator = 50, max depth = 10, min samples split = 2,
and min samples leaf = 1. The mean squared error (MSE) of the RF model's
predictions was 0.002617, or satisfactory precision.
Keywords: Cross validation, grid
search; K-Means; poverty; random forest regression
Abstrak
Isu kemiskinan merupakan isu penting untuk dibincangkan kerana kemiskinan mempengaruhi pelbagai aspek kehidupan seperti jurang sosio-ekonomi, jenayah serta akses yang terhad kepada pendidikan berkualiti. Sumatera
Utara merupakan salah satu daripada 5 wilayah teratas dengan jumlah kemiskinan tertinggi di Indonesia. Suatu strategi diperlukan untuk mendapatkan maklumat kemiskinan yang tepat supaya pengurusan kemiskinan disasarkan dan berkesan. Oleh itu, pemetaan dan ramalan kemiskinan dijalankan bagi mendapatkan maklumat yang lebih terperinci tentang taburan reruang kemiskinan dan apakah model kemiskinan di Sumatera Utara. Pendekatan yang diambil untuk memetakan dan meramalkan kemiskinan di Sumatera Utara ialah dengan menggunakan pembelajaran mesin (ML). Pemetaan kemiskinan dijalankan dengan menggunakan algoritma K-Means, manakala ramalan kemiskinan dijalankan menggunakan algoritma hutan rawak (RF). Hasil yang diperoleh daripada pemetaan kemiskinan di Wilayah Sumatera Utara jika dilihat daripada graf siku menunjukkan graf tersebut masih mengalami penurunan nilai inersia yang mendadak pada kelompok ke-3 dan ke-4. Manakala jika dilihat dari nilai indeks Siluet, kelompok ke-3 adalah lebih tinggi daripada kelompok ke-4 dengan nilai indeks Siluet masing-masing adalah 0.313 dan 0.244. Maka dapat disimpulkan bahawa kluster kemiskinan yang digunakan ialah 3 dengan label rendah, sederhana dan tinggi. Manakala, hasil ramalan menggunakan algoritma hutan rawak dengan teknik keesahan silang carian grid memperoleh model terbaik dengan parameter n penganggar =
50, kedalaman maks = 10,
min pecahan sampel = 2 dan min sampel daun = 1. Peramalan model RF menghasilkan ketepatan tinggi yang mencukupi dan Min Ralat Kuasa Dua (MSE) ialah 0.002617.
Kata kunci:
Carian grid; keesahan silang; kemiskinan; K-Means; regresi hutan rawak
REFERENCES
Ade Bastian, Harun Sujadi & Gigin Febrianto. 2018. Penerapan algoritma k-means clustering analysis pada penyakit menular manusia (studi kasus kabupaten Majalengka). Jurnal Sistem Informasi 14(1):
26-32. https://doi.org/10.21609/jsi.v14i1.566
Ade Syahputra, Mulyanto, Agustinus Suryantoro & Lukman Hakim.
2022. Analisis deskriptif potensi daerah dan tingkat kemiskinan di Sumatera Utara. Prosiding Seminar Nasional Universitas Abdurachman Saleh Situbondo,
September 2022. pp. 38-47. https://unars.ac.id/ojs/index.php/prosidingSDGs/article/view/2308%0Ahttps://unars.ac.id/ojs/index.php/prosidingSDGsarticle /download/2308/1629
Alpayidin,
E. 2004. Introduction to Machine Learning
(Adaptive Computation and Machine Learning Series). Vol. 14. Massachusetts:
The MIT Press. https://doi.org/10.1017/s1351324906004438
Anggi Aprillia, Rulyanti Susi Wardhani & Muhammad Faisal Akbar. 2021. Analysis of
factors affecting poverty in the province of the Bangka Belitung Islands. Jurnal Ilmu Ekonomi Terapan 6(2):
188-201. https://doi.org/10.20473/jiet.v6i2.29184
Ao,
Y., Li, H., Zhu, L., Ali, S. & Yang, Z. 2019. The linear random forest
algorithm and its advantages in machine learning assisted logging regression
modeling. Journal of Petroleum Science
and Engineering 174: 776-789. https://doi.org/10.1016/j.petrol.2018.11.067
Berrar,
D. 2018. Cross-validation. Encyclopedia
of Bioinformatics and Computational Biology 1: 542-545.
https://doi.org/10.1016/B978-0-12-809633-8.20349-X
BPS. 1967. Jumlah Penduduk Miskin (Ribu Jiwa) Menurut Provinsi dan Daerah
2022-2023. https://www.bps.go.id/indicator/23/185/1/jumlah-penduduk-miskin-menurut-provinsi.html
Breiman,
L. 2001. Random forests. Machine Learning 45: 5-32. https://doi.org/https://doi.org/10.1023/A:1010933404324
Chai, T. & Draxler,
R.R 2014. Root mean square error (RMSE) or mean absolute error (MAE)? –
Arguments against avoiding RMSE in the literature. Geosci. Model Development 7(3): 1247-1250.
https://doi:10.5194/gmd-7-1247-2014
Han, S. 2022. Spatial stratification
and socio-spatial inequalities: The case of Seoul and Busan in South Korea. Humanities and Social Sciences
Communications 9: 23. https://doi.org/10.1057/s41599-022-01035-5
Hao,
J., Luo, S. & Pan, L. 2022. Rule extraction from biased random forest and
fuzzy support vector machine for early diagnosis of diabetes. Scientific Reports 12: 9858.
https://doi.org/10.1038/s41598-022-14143-8
He, L., Levine, R.A., Fan, J.,
Beemer, J. & Stronach, J. 2018. Random forest as
a predictive analytics alternative to regression in institutional research. Practical Assessment, Research and
Evaluation 23(1): 1-16.
Ikotun,
A.M., Ezugwu, A.E., Abualigah,
L., Abuhaija, B. & Heming, J. 2023. K-means
clustering algorithms: A comprehensive review, variants analysis, and advances
in the era of big data. Information
Sciences 622: 178-210.
Kaufman, L. & Rousseeuw, P.J. 1990. Finding
Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.
Kaushik, M. & Mathur, B. 2014. Comparative study of k-Means and
hierarchical clustering techniques. International
Journal of Software & Hardware Research in Engineering 2(6): 93-98.
Knifton,
L. & Inglis, G. 2020. Poverty and mental health:
Policy, practice and research implications. BJPsych Bulletin 44(5): 193-196. https://doi.org/10.1192/bjb.2020.78
Kullarni,
V.Y. & Sinha, P.K. 2013. Efficient learning of random forest classifier
using disjoint partitioning approach. Proceedings
of the World Congress on Engineering 2013. Vol II. July 3-5, London.
Liemohn,
M.W., Shane, A.D., Azari, A.R., Petersen, A.K., Swiger,
B.M. & Mukhopadhyay, A. 2021. RMSE is not enough:
Guidelines to robust data-model comparisons for magnetospheric physics. Journal of Atmospheric and
Solar–Terrestrial Physics 218: 105624. https://doi.org/10.1016/j.jastp.2021.105624
Lilik Sugiharti, Rudi Purwono,
Miguel Angel Esquivias & Hilda Rohmawati. 2023. The nexus between crime rates, poverty,
and income inequality: A case study of Indonesia. Economies 11(2): 62. https://doi.org/10.3390/economies11020062
Lipesa,
B.A., Okango, E., Omolo,
B.O. & Omondi, E.O. 2023. An application of a
supervised machine learning model for predicting life expectancy. SN Applied Sciences 5(7): 189.
https://doi.org/10.1007/s42452-023-05404-w
Liu, M., Hu, S., Ge, Y., Heuvelink, G.B.M., Ren, Z. & Huang, X. 2021. Using
multiple linear regression and random forests to identify spatial poverty
determinants in rural China. Spatial
Statistics 42: 100461. https://doi.org/10.1016/j.spasta.2020.100461
Marcot,
B.G. & Hanea, A.M. 2021. What is an optimal value
of k in k-fold cross-validation in discrete Bayesian network analysis? Computational Statistics 36(3):
2009-2031. https://doi.org/10.1007/s00180-020-00999-9
Muhammad Al Faruq & Indah
Yuliana. 2023. The effect of population growth on poverty through unemployment
in East Java Province in 2017-2021. Journal
of Social Research 2(6): 1900-1915. https://doi.org/10.55324/josr.v2i6.872
Nichols, J.A., Chan, H.W.H. &
Baker, M.A.B. 2019. Machine learning: Applications of artificial intelligence
to imaging and diagnosis. Biophysical
Reviews 11(1): 111-118. https://doi.org/10.1007/s12551-018-0449-9
Nicolaus, Evy Sulistianingsih & Hendra Perdana.
2016. Penentuan jumlah cluster optimal pada median linkage dengan indeks validitas silhouette. Buletin Ilmiah Math. Stat. dan Terapannya (Bimaster) 05(2): 97-102.
Nowak-Brzezińska,
A. & Gaibei, I. 2022. How the outliers influence
the quality of clustering? Entropy 24(7): 917. https://doi.org/10.3390/e24070917
Omran,
M.G.H., Engelbrecht, A.P. & Ayed Salman. 2007. An overview of clustering methods. Intelligent Data Analysis 11(6): 583-605.
https://doi.org/10.3233/ida-2007-11602
Pérez-Ortega, J., Almanza-Ortega,
N.N. & Romero, D. 2018. Balancing effort and benefit of k-means clustering
algorithms in big data realms. PLoS ONE 13(9):
e0201874. https://doi.org/10.1371/journal.pone.0201874
Peshawa Jamal Muhammad Ali & Rezhna Hassan Faraj. 2014. Data normalization and standardization: A
technical report. Machine Learning
Technical Reports 1(1): 1-6. https://docs.google.com/document/d/1x0A1nUz1WWtMCZb5oVzF0SVMY7a_58KQulqQVT8LaVA/edit#
Pitafi,
S., Anwar, T. & Sharif, Z. 2023. A taxonomy of machine learning clustering
algorithms, challenges, and future realms. Applied
Sciences (Switzerland) 13(6): 3529. https://doi.org/10.3390/app13063529
Pratama,
Y.C. 2015. Analisis faktor-faktor yang mempengaruhi kemiskinan Di Indonesia. Esensi 4(2): 45-53. https://doi.org/10.15408/ess.v4i2.1966
Reliusman Dachi, Didi Nuryadin & Joko Susanto.
2022. Determinan tingkat kemiskinan di Kepulauan Nias tahun 2011 - 2019: Pendekatan regresi spasial. Syntax
Literate: Jurnal Ilmiah Indonesia 7(7): 8994-9008.
Rousseeuw,
P.J. 1987. Silhouettes: A graphical aid to the interpretation and validation of
cluster analysis. Journal of
Computational and Applied Mathematics 20(1): 53-65.
https://doi.org/10.1016/0377-0427(87)90125-7
Schonlau,
M. & Zou, R.Y. 2020. The random forest algorithm for statistical learning. Stata Journal 20(1): 3-29.
https://doi.org/10.1177/1536867X20909688
Sena,
S. 2018. Pengenalan deep learning Part 8: Gender
classification using pre-trained network (transfer learning). Medium.
https://medium.com/@samuelsena/pengenalan-deep-learning-part-8-gender-classification-using-pre-trained-network-transfer-37ac910500d1
Shukla, S. 2014. A review on k-means
data clustering approach. International
Journal of Information & Computation Technology 4(17): 1847-1860.
http://www.irphouse.com
Singh, D. & Singh, B. 2022.
Feature wise normalization: An effective way of normalizing data. Pattern Recognition 122: 108307.
https://doi.org/10.1016/j.patcog.2021.108307
Spada,
A., Fiore, M. & Galati, A. 2023. The impact of education and culture on
poverty reduction: Evidence from panel data of European countries. Social Indicators Research https://doi.org/10.1007/s11205-023-03155-0
Syakur,
M.A., Khotimah, B.K., Rochman,
E.M.S. & Satoto, B.D. 2018. Integration k-means
clustering method and elbow method for identification of the best customer
profile cluster. IOP Conference Series:
Materials Science and Engineering 336: 012017. https://doi.org/10.1088/1757-899X/336/1/012017
Taye,
M. 2023. Understanding of machine learning with deep learning: Architectures,
workflow, applications and future directions. Computers 12: 91. https://doi.org/10.3390/computers12050091
The World Bank. 2022. Poverty. הארץ. 2022. https://www.worldbank.org/en/topic/poverty/overview
Tri Wahyudi & Titi Silfia. 2022.
Implementation of data mining using k-means clustering method to determine
sales strategy in S&R baby store. Journal
of Applied Engineering and Technological Science 4(1): 93-103.
https://doi.org/10.37385/jaets.v4i1.913
Watson, D.S. 2023. On the philosophy
of unsupervised learning. Philosophy
& Technology 36: 28. https://doi.org/10.1007/s13347-023-00635-6
Yunendah Nur Fuadah, Ibnu Dawan Ubaidillah, Nur Ibrahim, Fauzi Frahma Taliningsing, Nidaan Khofiya SY & Muhammad
Adnan Pramuditho. 2022. Optimasi convolutional neural network dan k-fold cross
validation pada sistem klasifikasi glaukoma. ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika 10(3): 728-741. https://doi.org/10.26760/elkomika.v10i3.728
*Corresponding author; email: arnita@unimed.ac.id
|